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1.  Introduction 


Analytical  models  for  hardware  failure  have  been  extensively  investigated  in  the  literature 
along  with  performability  issues  [l].  Although  the  time  for  different  components  to  fail  is  usually 
assumed  to  be  exponentially  distributed,  time-dependent  failure  rates  and  graceful  degradation 
have  been  considered  [2],  Automatic  availability  evaluation,  assuming  a  Markov  models,  is  dis¬ 
cussed  in  [3],  A  job/task  flow  based  model  is  described  in  [4],  where  failure  occurrence  is  assumed 
to  be  a  linear  function  of  the  service  requests  from  a  job/ task  flow.  As  shown  in  [5],  this  linear 
assumption  may  result  in  underestimating  the  effect  of  the  workload,  especially  when  the  load  is 
high.  A  summary  of  research  in  software  reliability  growth  models  is  discussed  in  [6]:  run-time 
software  reliability  modeling  is  discussed  in  [7]. 

Although  many  authors  have  addressed  the  modeling  issue  and  have  significantly  advanced 
the  state  of  the  art.  none  have  addressed  the  issue  of  how  to  identify  the  model  structure.  Further, 
very  few  of  either  the  hardware  or  the  software  models  have  been  validated  with  real  data.  Excep¬ 
tions  are  the  joint  hardware/software  model  discussed  in  [8]  and.  a  measurement-based  model  of 
workload  dependent  failures  discussed  in  [5].  Both,  however,  describe  only  the  external  behavior 
of  the  system  and  thus  do  not  provide  insight  into  component-level  behavior. 

In  this  paper  we  build  a  semi-Markov  model  to  describe  the  resource-usage/error/recovery 
process  in  a  large  mainframe  system.  The  model  is  based  on  one  year  of  low-level  error  and  per¬ 
formance  data  collected  on  a  production  IBM  3081  system  running  under  the  MVS  operating  sys¬ 
tem.  The  3081  system  consisted  of  dual  processors  with  two  multiplexed  channel  sets.  Both  the 
normal  and  erroneous  behavior  of  the  system  are  modeled.  A  reward  function,  based  on  the  service 
rate  and  the  error  rate  in  each  state,  is  defined  in  order  to  estimate  the  performability  of  the  sys¬ 
tem.  and  to  depict  the  cost  of  different  error  types  and  recovery  procedures.  Two  key  contribu¬ 
tions  of  this  paper  are: 

(1)  A  method  for  identifying  a  model-structure  for  the  resource-usage/error/recovery  process  is 

introduced  and  the  resulting  model  is  validated  against  real  data. 


(2)  It  is  shown  that  a  semi-Markov  model  may  better  represent  system  behavior  as  opposed  to  a 
Markov  model. 

2.  Resource  Usage  Characterization 

In  this  section  we  identify  a  state-transition  model  to  describe  the  variation  in  system 
activity.  System  activity  was  characterized  by  measuring  a  number  of  resource  usage  parameters. 
A  statistical  clustering  technique  was  then  employed  to  identify  a  small  number  of  representative 
states. 

The  resource  usage  data  were  collected  by  sampling,  at  pre-determined  intervals,  a  number  of 
resource  usage  meters,  using  the  IBM  MVS/370  system  Resource  Measurement  Facility  (RMF).  A 
sample-time  of  500  milliseconds  was  used  in  this  study.  Four  different  resource  usage  measures 
were  used: 

CPU  -  fraction  of  the  measured  interval  for  which  the  CPU  is  executing  instructions. 

CHB  -  fraction  of  the  measured  interval  for  which  the  channel  was  busy  and  the  CPU  was  in 

the  wait  state  (this  parameter  is  commonly  used  to  measure  the  degree  of  contention  in 
a  system) 

SIO  -  number  of  successful  Start  I/O  and  Resume  I/O  instructions  issued  to  the  channel 

DASD  -  number  of  requests  serviced  on  the  direct  access  storage  devices 

At  any  interval  of  time  the  measured  workload  is  represented  by  a  point  in  a  4-dimensional  space. 
(CPU.  CHB.  SIO.  DASD).  Statistical  cluster  analysis  is  used  to  divide  the  workload  into  similar 
classes  according  to  a  pre-defined  criterion.  This  allows  us  to  concisely  describe  the  dynamics  of 
system  behavior  and  extract  a  structure  that  already  exists  in  the  workload  data.1 

Each  cluster  (defined  by  its  centroid)  is  then  used  to  depict  a  system  state  and.  a  state- 
transition  diagram  (consisting  of  inter-cluster  transition  probabilities  and  cluster  sojourn  times)  is 
developed.  A  t-means  clustering  algorithm  [10]  is  used  for  cluster  analysis.  The  algorithm  parti¬ 
tions  an  ^/-dimensional  population  into  k  sets  on  the  basis  of  a  sample.  The  k  non-empty  clusters 
sought.  CVC2 . Ck ,  are  such  that  the  sum  of  the  squares  of  the  Euclidean  distances  of  the  cluster 

i 

members  from  their  centroids.  £  £  '  I xi~*j  1 12,  is  minimized,  where  Xj  is  the  centroid  of  cluster 

*t*c i 

'Similar  clustering  techniques  are  also  used  for  workload  characterization  in  [9]. 


Two  types  of  workload  clusters  were  formed.  In  the  first  case  CPU  and  CHB  were  selected  to 
be  the  workload  variables.  This  combination  was  found  to  best  describe  the  CPU-bound  load 
(nearly  60%  of  the  observations  have  a  CPU  usage  greater  than  0.72).  In  the  second  case  the  clus¬ 
ters  were  formed  considering  SIO  and  DASD  as  the  workload  variables.  This  combination  was 
found  to  best  describe  the  I/O  workload.  In  this  paper,  only  the  results  for  CPU-bound  load  clus¬ 
ters  are  presented.  Details  of  I/O  activity  can  be  found  in  [ll].  Table  1  shows  the  results  of  the 
clustering  operation.  The  table  shows  that  about  36%  of  the  time  the  CPU  was  heavily  loaded 
(0.96)  and  almost  76%  of  the  time  the  CPU  load  was  above  0.5.  Since  the  measured  system  con¬ 
sisted  of  two-processors,  we  may  say  that  76%  of  the  time  at  least  one  of  the  processors  is  busy.  A 
state-transition  diagram  of  CPU-bound  load  activity  is  shown  in  Figure  1.  Note  that  a  null  state. 
Wq,  has  been  incorporated  to  represent  the  state  of  the  system  during  the  non-measured  period. 
The  time  spent  in  the  null  state  was  assumed  to  be  zero.  The  transition  probability  from  state  i  to 
state  j .  Pi  j.  was  estimated  from  the  measured  data  using: 

observed  no.  of  transitions  from  state  i  to  state  j 

PiJ  ~  - -  • 

observed  no.  of  transitions  from  state  i 


Cluster 

id 

KE9 

K3R 

Mean 
of  CPU 

Std  dev 
of  CPU 

7.44 

0.0981 

0.1072 

0.0462 

0.0436 

w2 

0.50 

0.1126 

0.5525 

0.0433 

0.0669 

2.73 

0.1547 

0.2801 

0.0647 

0.0755 

W4 

12.41 

0.3105 

0.1637 

0.0550 

0.0459 

0.74 

0.3639 

0.3819 

0.0365 

0.1923 

17.12 

0.5416 

0.1287 

0.0560 

0.0511 

W7 

22.58 

0.7207 

0.0848 

0.0576 

0.0301 

We 

36.48 

0.9612 

0.0168 

0.0362 

0.0143 

R  of  CPU  -  0.9724 

R2  of  CHB  -  0.8095 

overall  R2  -  0.9604 


R2  :  the  square  of  correlation  coefficient 


Table  1.  Characteristics  of  CPU-bound  workload  clusters 
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Figure  1.  State-transition  diagram  of  CPU-bound  load 
In  the  next  section  the  characterization  of  the  errors  and  the  recovery  process  is  discussed. 
The  appropriate  error  and  recovery  states  are  identified  for  subsequent  use  in  developing  an  overall 
model. 

3.  Error  and  Recovery  Characterization 

The  IBM  system  has  built-in  error  detection  facilities  and  there  are  many  provisions  for 
hardware  and  software  initiated  recovery  through  retry  and  redundancy.  The  error  and  recovery 
daw  are  automatically  logged  by  the  operating  system  as  the  errors  occur.  On  the  occurrence  of  an 
error  the  operating  system  creates  a  time-stamped  record  describing  the  error,  the  state  of  the 
machine  at  the  time  of  the  error  and,  the  result  of  the  hardware  and/or  software  attempts  to 
recover  from  the  error.  Details  of  this  logging  mechanism  are  described  in  [12].  Due  to  the  manner 
in  which  errors  are  detected  and  reported  in  a  computer  system,  it  is  possible  that  a  single  fault 
may  manifest  itself  as  more  than  one  error,  depending  on  the  activity  at  the  time  of  the  error.  The 


different  manifestations  may  not  all  be  identical  [13].  The  system  recovery  usually  treats  these 
errors  as  isolated  incidents.  Thus,  the  raw  data  can  be  biased  by  error  records  relating  to  the  same 
problem.  In  order  to  address  this  problem,  two  levels  of  data  reduction  were  performed.  First,  a 
coalescing  algorithm  described  in  [5]  was  used  to  analyze  the  data  and  merge  observations  which 
occur  in  rapid  succession  and  relate  to  the  same  problem.  Next,  a  reduction  technique  described  in 
[13]  to  automatically  group  records  most  likely  to  have  a  common  cause  was  used.  By  using  these 
two  methods,  the  errors  were  classified  into  five  different  classes.  These  classes  are  called  error 
events  since  they  may  contain  more  than  one  error  and  are  explained  below: 

CPU  :  Errors  which  affect  the  normal  operation  of  the  CPU:  may  originate  in  the  CPU.  in  the 

main  memory,  or  in  a  channel 

CHAN  :  Channel  errors  (the  great  majority  are  recovered) 

DASD  :  Disk  errors,  recoverable  (by  data  correction,  hardware  instruction  retry  or  software 

instruction  retry)  and  non-recoverable  disk 

SWE  :  Software  incidents  due  to  invalid  supervisor  calls,  program  checks  and  other  software 

exception  conditions 

MULT  :  Multiple  errors  affecting  more  than  one  type  of  component  (i.e.,  involving  more  than 
one  of  the  above) 

Table  2  lists  the  frequencies  of  different  types  of  errors.  Notice  that  about  17%  of  errors  are 
classified  as  multiple  errors  (MULT).  A  MULT  error  is  mostly  due  to  a  single  cause  but  the  fault 
has  non-identical  manifestations,  provoked  by  different  types  of  system  activity.  Since  the  man¬ 
ifestations  are  non-identical,  recovery  may  be  complex  and  hence  can  (as  will  be  seen  later)  impose 
considerable  overhead  on  the  system. 


Type  of  error 

Frequency 

Percent 

CPU 

2 

0.04 

CHAN 

119 

2.23 

MULT 

924 

17.33 

SWE 

1923 

36.07 

DASD 

2364 

44.33 

total 

5332 

100.00 

Table  2.  Frequency  of  errors 


Figure  2.  Flow  chart  of  recovery  processes 

The  recovery  procedures  were  divided  into  four  categories  based  on  recovery  cost,  which  was 
measured  in  terms  of  the  system  overhead  required  to  handle  an  error.  The  lowest  level  (hardware 
recovery),  involves  the  use  of  an  error  correction  code  (ECC)  or  hardware  instruction  retry  and  has 
minimal  overhead.  If  hardware  recovery  is  not  possible  (or  unsuccessful),  software  controlled 
recovery  is  invoked.  This  could  be  simple.  e.g.,  terminating  the  current  program  or  task  in  control, 
or  complex,  e.g..  invoking  specially  designed  recovery  routine(s)  to  handle  the  problem.  The  third 
level,  alternative  (ALT),  involves  transferring  the  tasks  to  functioning  processor^)  when  one  of 
the  processors  experiences  an  un-recoverable  error.  If  no  on-line  recovery  is  possible,  the  system  is 
brought  down  for  off-line  (OFFL)  repair.  Figure  2  shows  a  flow  chart  of  the  recovery  process.  The 
time  spent  in  each  recovery  state  was  taken  to  be  constant,  since  each  recovery  type  except  OFFL 
requires  almost  constant  overhead.2 

2  Hardware  recovery  involves  hardware  instruction  retry  or  ECC  correction.  The  maximum  number  of  retries  is 
predetermined.  Each  CPU  has  a  26-nanosecond  machine  cycle  time  and  the  disk  seek  time  is  about  23  milliseconds.  We 
estimate  a  worst  case  hardware  recovery  cost  of  0.3  seconds,  i.e.  incorporating  twenty  I/O  retries:  ten  through  the  original 
I/O  path  and  another  ten  through  an  alternative  I/O  path  if  the  alternative  is  available.  This,  of  course,  over  estimates  the 
cost  of  hardware  retry  used  for  the  CPU  errors.  Similarly,  the  worst-case  software  recovery  time  was  estimated  to  be  1 
second.  The  ALT  state  was  not  evaluated  since  it  did  not  occur  in  the  data.  For  OFFL  the  time  was  calculated  to  be  1  hour 
based  on  our  experience  and  through  discussion  with  maintenance  engineers. 


4.  Resource-Usage/Error/Recovery  Model 


In  this  section  we  combine  the  separate  workload,  error  and  recovery  models  developed  into  a 
single  model  shown  in  Figure  3.  The  null  state  W0  is  not  shown  in  this  diagram.  The  model  has 
three  different  classes  of  states:  normal  operation  states  (SN).  error  states  (S£).  and  recovery  states 
(,Ss ).  Under  normal  conditions,  the  system  makes  transitions  from  one  workload  state  to  another. 
The  occurrence  of  an  error  results  in  a  transition  to  one  of  the  error  states.  The  system  then  goes 
into  one  or  more  recovery  modes  after  which,  with  a  high  probability,  it  returns  to  one  of  the 
"good"  workload  states3.  The  state  transition  diagram  shows  that  nearly  98.3%  of  hardware 
recovery  requests  and  99.7%  of  software  recovery  requests  are  successful.  Thus  the  error 


Figure  3.  State-transition  diagram  of  resource-usage/error/recovery  model 


3  Note  th»t  the  transition  probabilities  from  to  W 3  are  different  from  those  in  figure  1  where  erTor  states  were  not 
considered  in  computing  the  transition  probabilities. 


detection,  fault  isolation  and  on-line  recovery  mechanisms  allow  the  measured  system  to  handle  an 
error  efficiently  and  effectively.  In  less  than  1%  of  the  cases  is  the  system  not  able  to  recover. 

One  state  which  needs  further  elaboration  is  the  MULT  state.  Recall  that  a  MULT  state 
denotes  a  multiple  error  event  affecting  more  than  one  component  type.  Figure  4  shows  the  state- 
transition  diagram  of  a  MULT  error  event,  i.e.,  the  transition  diagram  given  a  MULT  error.  The 
model  quantifies  the  interactions  between  the  different  components  in  a  multiple  error  occurrence. 
From  the  diagram,  it  is  seen  that  in  about  65%  of  the  cases  a  multiple  error  starts  as  a  software 
error  (SWE)  and  in  32%  of  the  cases  it  starts  as  a  disk  error  (DASD).  Given  that  a  disk  error  has 
occurred,  there  is  nearly  a  30%  chance  that  a  software  error  will  follow.  It  is  also  interesting  to 
note  that  there  is  a  64%  chance  that  one  software  error  will  be  followed  by  another  different 
software  error. 


entry  to 


Figure  4.  State-transition  diagram  of  a  given  multiple  error  (MULT) 
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4.1.  Distributions  for  Workload  and  Error  States 


Table  3  shows  the  characteristics  of  both  the  workload  and  error  states  in  terms  of  their  wait¬ 
ing  times.  An  examination  of  the  mean  and  standard  deviation  of  the  waiting  times  indicates  that 
not  all  waiting  times  are  simple  exponentials.  This  is  particularly  pronounced  for  the  error  states. 


Workload 

Error 

Mean 

Standard 

Mean 

Standard 

State 

waiting  time 

deviation 

State 

waiting  time 

deviation 

1263.71 

1384.20 

CPU 

* 

* 

W2 

289.65 

1.19 

CHAN 

5.08 

18.31 

698.79 

913.30 

SWE 

41.35 

103.35 

1203.05 

1130.28 

DASD 

120.86 

223.89 

w5 

613.74 

421.73 

MULT 

293.28 

262.84 

1380.86 

1588.76 

1071.31 

1004.46 

^a 

1612.72 

2576.35 

*  statistically  insignificant 

Table  3.  Characteristics  of  waiting  time  (seconds) 
in  workload  and  error  states 


(b).  Holding  time  density  from  Wg  to  fVE  state 
Figure  5.  Waiting  and  holding  time  densities 
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Figure  5(a)  and  5(b)  shows  the  densities  of  waiting  time  for  W 8  state  and  also  the  specific  holding 
time  to  the  SWE  error  state.  The  waiting  time  for  state  i  is  the  time  that  the  process  spends  in 
state  i  before  making  a  transition  to  any  other  state.  The  holding  time  for  a  transition  from  state  i 
to  state  j  is  the  time  that  the  process  spends  in  state  i  before  making  a  transition  to  state  j  [14]. 
This  is  the  same  as  the  distribution  of  a  one-step  transition  from  state  i  to  j .  The  distributions  in 
figure  5  are  fitted  to  phase-type  exponential  density  functions  [15]  and.  tested  by  using  the 
Kolmogorov-Smirnov  test  at  a  0.01  significance  level. 

4.2.  Error  Duration  Distributions 

Recall  that  an  error  event  can  involve  more  than  one  error  and  since  errors  frequently  occur 
in  bursts.  During  an  error  burst  the  system  goes  into  an  error  ->  recovery  cycle  until  the  error 
condition  disappears4.  In  such  cases  we  measure  the  duration  of  an  error  event  as  the  time 
difference  between  the  first  detected  error  and  the  last  detected  error,  caused  by  the  same  event. 
The  duration  of  an  error  event  can  be  used  to  measure  the  severity  of  the  error.  Since  each 
recovery  type  takes  approximately  a  constant  amount  of  time,  the  loss  of  work  can  be  approxi¬ 
mated  by  the  error  rate  in  this  period.  In  section  6,  we  use  this  information  to  build  a  reward 
model  for  the  system.  Figure  6  shows  examples  of  error  duration  densities  for  two  different  types 
of  errors.  SWE  and  MULT. 

In  summary,  we  have  developed  a  state-transition  model  which  describes  the  normal  and 
error  behavior  of  the  system.  A  key  characteristic  of  the  model  is  that  the  waiting  time  in  some  of 
the  workload  states  and  in  most  error  states  cannot  be  modeled  as  simple  exponentials.  Further¬ 
more.  the  holding  times  from  a  given  workload  state  to  different  error  states  are  dependent  on  the 
destinations.  Thus,  the  overall  system  is  modeled  as  a  complex  irreducible  semi-Markov  process. 


^This  is  typical  of  many  systems  (e.g.  see  [8]  ).  The  final  recovery  usually  occurs  because  the  conditions  which 
triggered  the  error  disappear  due  to  change  in  system  activity. 


0 


5 
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Duration  (min.) 
(a).  Waiting  time  density  for  SWE 


Figure  6.  Error  duration  densities 

5.  Model  Behavior  Analysis 

Now  that  we  have  an  overall  model,  we  show  the  usage  of  this  model  to  predict  key  system 
characteristics.  The  mean  time  between  different  types  of  errors  is  evaluated  along  with  model 
characteristics,  such  as  the  occupancy  probabilities  of  key  normal  and  error  states. 

5.1.  General  Characteristics 

By  solving  the  semi-Markov  model,  we  find  that  the  modeled  system  made  a  transition  every 
9  minutes  and  8  seconds,  on  average.  In  comparing  this  with  the  mean  time  between  errors 
(MTBE)  listed  in  Table  4,  it  is  clear  that  most  often  the  transitions  are  from  one  normal  state  to 
another.  The  table  also  shows  that  a  DASD  error  was  detected  almost  every  52  minutes  (0.87 
hours)  while  a  software  error  was  detected  every  1  hour  and  45  minutes.  Most  of  the  DASD  errors 
(95%)  were  recovered  through  hardware  recovery  (i.e.,  hardware  instruction  retry  or  ECC),  thus 
resulting  in  negligible  overhead.  Table  4  also  lists  the  mean  recurrence  time  for  recovery  states. 
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_ _ Error  states _ _ Recovery  states _ 

~CPU  CHAN  SWE  DASD  MULT  HWR  SWR  ALT  OFFL 

=  -  26.88  1.75  0.87  4.62  H  0.62  2.57  -  651.37 

Table  4.  Mean  recurrence  time  (hours)  of  error  and  recovery  states 
Thus,  the  on-line  hardware  recovery  routine  is  invoked  once  every  0.62  hours,  while  the  software 
recovery  occurs  every  2.57  hours.  By  using  an  estimated  time  for  each  hardware  recovery  and 
comparing  the  results  with  the  recovery  overhead,  we  estimate  that  the  cost  of  hardware  recovery 
is  only  0.02%  of  total  computation  time.  The  mean  recurrence  time  of  the  alternative  recovery 
routine  was  not  estimated,  due  to  lack  of  data.  i.e..  this  event  seldom  occurred. 

5.2.  Summary  Model  Probabilities 

Since  the  process  is  modeled  as  an  irreducible  semi-Markov  process,  we  can  evaluate  the  fol¬ 
lowing  steady  state  parameters  [14]: 

(1)  occupancy  probability  (<&, )  -  the  probability  that  the  process  occupies  state  j . 

(2)  conditional  entrance  probability  (ir^ )  -  given  that  the  process  is  now  making  a  transition,  the 

probability  that  the  transition  is  to  state  /' 

(3)  entrance  rate  (ej )  -  the  rate  at  which  the  process  enters  state  j  at  any  time 

instance  (.«■  = - .  where  7  is  the  mean  time  between 


(4)  mean  recurrence  time  (8 j ) 


transitions) 

-  mean  time  between  successive  entries  into  state  j 


The  model  characteristics  are  summarized  in  Table  5.  A  dashed  line  in  this  table  indicates  a 
negligible  value  (statistically  insignificant).  Table  5(a)  shows  the  normal  system  behavior.  For 
example,  given  that  a  transition  occurs  the  system  is  most  likely  to  go  to  states  W7  or  Wt.  This  is 
also  reflected  in  the  respective  entrance  rates  and  occupancy  probabilities  for  the  mentioned  states. 
From  the  occupancy  probabilities  (4>)  we  see  that  almost  34%  of  the  time  the  CPU  load  is  as  high  as 
0.96  (W8);  39%  of  the  time  the  CPU  is  moderately  loaded  (W6  +  W7).  Table  5(b)  shows  the  error 
behavior  of  the  system.  The  table  shows  that  about  30%  of  the  transitions  are  to  an  error  state 
(obtained  by  summing  all  the  it’s  for  all  the  error  states).  The  DASD  errors  have  the  highest 


A 


Normal  state 


Measure 

W0 

W2 

W3 

w4 

ws 

w6 

w7 

w8 

$ 

0 

0.0623 

0.0008 

0.0136 

0.1238 

0.0034 

0.1639 

04233 

04398 

ir 

0.0237 

0.0264 

0.0014 

0.0104 

0.0339 

0.0047 

0.0633 

0.1123 

0.1127 

0.00003 

0.00003 

- 

0.00002 

0.0001 

0.00001 

0.00012 

0.00021 

0.00021 

e 

3.78 

3.62 

102.36 

14.32 

2.63 

3148 

243 

142 

142 

(a).  Normal  states 


Error  state 

Recovery  state 

Measure 

CPU 

CHAN 

SWE 

DASD 

MULT 

HWR 

SWR 

ALT 

OFFL 

<X> 

- 

0.00003 

0.0066 

0.0383 

0.0179 

0.00022 

0.00011 

- 

- 

IT 

- 

0.0033 

0.0830 

0.1692 

0.0322 

0.2379 

0.0372 

- 

0.00023 

e 

- 

0.00001 

0.00016 

0.00032 

0.00006 

0.00043 

0.00011 

- 

- 

e 

- 

26.88 

1.73 

0.87 

4.62 

0.62 

247 

- 

63147 

Cb).  Error  and  recovery  states 

•  sec  ••  in  hours 


Table  5.  Summary  of  model  characteristics 

entrance  probability.  For  the  data  shown  in  the  table  it  can  be  estimated  that  an  error  is  detected, 
on  the  average,  every  30  minutes.  Of  course,  over  98%  of  these  errors  incur  negligible  overhead. 

An  interesting  characteristic  of  the  multiple  errors  is  also  seen  in  Table  5(b).  Although  the 
entrance  probability  of  a  MULT  error  is  lower  than  that  for  SWE,  its  occupancy  probability  is 
higher.  This  is  due  to  the  fact  that  a  MULT  error  event  has  a  longer  mean  waiting  time  as  compared 
to  SWE  error  events  (293  seconds  versus  41  seconds). 


53.  Model  Validation 

Even  though  our  model  is  developed  from  real  data,  it  needs  to  be  validated  since  the  model 
identification  process.  e.g..  the  workload  clustering,  allow  us  to  only  approximate  the  real  system 
behavior.  In  order  to  evaluate  the  validity  of  the  model,  three  measures  evaluated  via  the  model 
were  compared  with  direct  calculations  from  the  actual  data.  Table  6  shows  the  comparison  of  the 
occupancy  probabilities  for  key  normal  states  (occupancy  probability  greater  than  0.1)  and  for  one 


key  error  state  (DASD).5  The  table  also  shows  the  comparison  for  the  mean  recurrence  time  (0)  of 
the  SWE  error  event  and  for  its  standard  deviation  (Std).  It  can  be  seen  that  all  the  predicted 
values  are  around  3  percent  or  less,  indicating  that  the  proposed  semi-Markov  model  is  an  accurate 
estimator  of  the  real  system  behavior.  This  also  provides  support  for  the  model  structure 
identification  method  employed  in  this  paper. 

5.4.  Markov  Versus  Semi-Markov 

This  section  investigates  the  significance  of  using  a  semi-Markov  model  to  describe  the  overall 
resource-usage/error/recovery  process.  It  has  been  argued  that  since  errors  only  occur  infrequently 
(i.e..  X  is  small),  a  Markov  model  may  well  approximate  the  real  behavior.  Clearly,  if  only  the 
first  moments,  e.g..  MTBE.  are  of  interest  the  Markov  model  provides  adequate  information.  If  the 
distributions  (e.g..  the  time  to  error  distribution)  or  higher  moments  are  of  interest  the  Markov 
model  may  be  inadequate.  Thus,  although  our  evidence  shows  that  the  semi-Markov  process  is  a 
better  model,  i.e.,  more  closely  approximates  the  data  from  the  measured  system,  it  is  reasonable  to 
ask  what  deviations  occur  if  a  Markov  process  is  assumed.  In  order  to  answer  this  question  we  use 
a  Markov  model  to  describe  our  system  and  compare  the  results  with  those  obtained  through  the 
more  realistic  semi-Markov  model. 


$ 

e 

Std 

HEuHH 

■M 

SWE 

SWE 

Model 

0.1258 

0.1639 

0.2255 

0.3398 

0.0383 

1.75 

2.18 

0.1259 

0.1634 

0.2311 

0.3452 

0.0386 

1.72 

2.11 

sum 

0.0008 

0.0031 

0.0242 

_ 

0.0156 

0.0156 

0.022 

0.033 

6  :  the  absolute  error,  I  Model  -  Actual  I 
Table  6.  Comparison  of  $,  6  and  standard  deviation 


*  The  'Actual'  values  are  calculated  from  observed  data.  For  example: 

total  time  that  the  system  was  observed  to  be  in  state  i 


length  of  the  obeerratlon  period 


15 


We  compared  the  two  by  calculating  two  steady  state  parameters.  The  first  is  the  complemen¬ 
tary  distribution  of  the  time  to  error  (referred  to  as  R(t))  for  different  error  types.  The  second  is 
the  standard  deviation  of  R(t).  The  results  for  the  SWE  state  are  shown  in  Table  7.  It  is  clear  that 
the  Markov  model  over  estimates  the  R(t)  in  the  early  life  (for  low  time  to  error  probabilities)  and 
under  estimates  R(t)  for  high  time  to  error  probabilities.  The  standard  deviation  is  also  consider¬ 
able  under  estimated  by  the  Markov  model  thus  casting  doubts  on  the  validity  of  using  MTBE  esti¬ 
mates  themselves. 

In  summary,  our  measurements  show  that  using  a  Markov  model  is  optimistic  in  the  short 
run  and  pessimistic  in  the  long  run.  The  underestimation  of  the  standard  deviation  of  R(t)  is  also  a 
serious  problem  because  it  calls  into  question  the  representativeness  of  the  MTBE  estimates. 


6.  Performability  Analysis 

In  this  section  we  use  the  workload/error/recovery  model  to  evaluate  the  performability  of 
the  system.  Reward  functions  are  used  to  depict  the  performance  degradation  due  to  errors  and 
also  due  to  different  types  of  recovery  procedures.  Since  the  recovery  overhead  for  each  recovery 
state  in  the  modeled  system  is  approximately  constant,  the  total  recovery  overhead  for  each  error 
event  and  thus  the  reward  depends  on  the  error  rate  during  that  event.  Thus,  higher  the  error  rate 
during  an  error  event,  the  higher  is  the  recovery  overhead  and,  hence  lower  the  reward.  On  this 
basis  we  define  a  reward  the  reward  rate.  r(.  (per  unit  time)  for  each  state  of  the  model  as  follows: 


R(t) 

Std  (mins) 

semi-Markov 

0.99 

0.61 

0.44 

0.32 

0.24 

0.18 

0.14 

0.10 

28.63 

Markov 

1.00 

0.71 

0.50 

0.35 

0.25 

0.17 

0.12 

0.09 

25.13 

Time  (mins) 

0 

7.5 

15.0 

22.5 

30.0 

37.5 

45.0 

52.5 

Table  7.  Comparison  between  Markov  and  semi-Markov 
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where,  the  st  and  ei  are  the  service  rate  and  the  error  rate  in  state  i .  respectively.  Thus  one  unit  of 
reward  is  given  for  each  unit  of  time  when  the  process  stays  in  the  good  states  V  The  penalty 
paid  depends  on  the  number  of  errors  generated  by  an  error  event.  With  an  increasing  number  of 
errors  the  penalty  per  unit  time  increases,  and  accordingly,  the  reward  rate  decreases.  Zero  reward 
is  assigned  to  recovery  states.  Based  on  this  proposal,  reward  rates  for  the  error  states  are  as 
shown  in  Table  8. 

The  reward  rate  of  the  modeled  system  at  time  t  is  a  random  variable  X(r  ).  which  is  defined 
as 


1  process  is  in  state i  £5^ 

X(r )  *  ri  process  is  in  state  i  e  SE 
0  otherwise . 

Therefore  the  expected  reward  rate  E[X(f)]  can  be  evaluated  from  E[X(r )]  =  £  (f  >■;  .  The 


cumulative  reward  by  time  t  is  Y(t )  »  f  X( <r)d  <r  and  the  expected  cumulative  reward  is  given  by 

o 

[16]: 


t 


t 


E[Y(f )]  *  E 


/  X(<jV< 


Ififpite'Ho- . 


'o  '  «  o 

where  px  (r )  is  the  probability  of  being  in  state  i  at  time  t .  In  order  to  evaluate  p,  (t )  and  hence 
other  measures,  we  convert  the  semi-Markov  process  into  a  Markov  process  using  the  method  of 
stages  [15.17].  The  state  probability  vector  P  (r)  ■  )....)  of  the  Markov  process  can  be 


State 

DASD  SWE 

CHAN 

MULT 

WBM\ 

0.5708  0.2736 

0.9946 

0.2777 

Table  8.  Reward  rates  for  error  states 
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H  derived  from  P  (r)  -  P  (0)  e®  .  where  P  (0)  =  (1,0 . 0)  and  Q  is  transition  rate  matrix  of  the 

Markov  process  [14]. 
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In  order  to  study  the  performance  degradation  due  to  different  types  of  errors,  the  irreducible 
semi-Markov  process  was  transformed  by  considering  the  OFFL  (off-line  repair)  state  as  the 
absorbing  state.  The  expected  reward  calculated  with  this  assumption  indeed  reflects  the  true  per- 
formability  until  system  failure.  Next,  for  evaluating  the  impact  of  different  error  events  we  first 
observe  that  often  these  events  have  significant  error  duration  time  (e.g..  MULT  state  has  an  mean 
error  duration  of  5  minutes  with  a  standard  deviation  of  4  minutes).  Since  the  majority  of  jobs 
last  less  than  a  few  minutes,  as  far  as  a  user  program  is  concerned,  an  entry  into  an  long  duration 
error  state  is  similar  to  entering  an  absorption  state  with  ri  >0.  Thus,  the  impact  of  the  MULT  can 
be  evaluated  by  making  it  into  an  absorption  state  with  ri  >0.  A  similar  analysis  can  be  performed 
for  other  error  states.6 

In  our  analysis,  we  first  make  the  OFFL  state  the  absorbing  state.  This  gives  the  expected  per¬ 
formance  until  an  off-line  failure.  Then  we  evaluate  three  other  cases. 

a)  OFFL  case  (OFFL). 

b)  MULT  and  OFFL  case  (MULT). 

c)  SWE.  MULT  and  OFFL  case  (SWE).  and 

d)  DASD.  MULT  and  OFFL  case  (DASD). 
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Case  (a)  gives  the  overall  performability  of  the  system  asstuning  that  OFFL  (off-line  repair)  is  the 
absorbing  state,  i.e..  the  impact  of  all  other  error  events  are  taken  into  account.  This  gives  both  the 
transient  and  steady  state  performability  of  the  system.  Next  we  assume  in  case  (b)  that  both 
MULT  and  OFFL  are  absorbing.  The  difference  between  (a)  and  (b)  approximates  the  expected  per¬ 
formance  loss  due  to  possible  entry  into  a  long  duration  MULT  state.  Similarly,  the  difference 
between  (a)  and  (c)  provides  an  estimate  of  loss  of  performance  due  to  entry  into  an  SWE  state.  In 
the  long  term,  of  course,  each  will  reach  a  steady  state  value.  The  above  analyses  were  performed 


1 »» 


on  the  resulting  Markov  reward  model  of  the  system  using  SHARPE  (the  Symbolic  Hierarchical 
Automated  Reliability  and  Performance  Evaluator)7  developed  at  Duke  University: 

The  curves  of  Figure  7  show  the  expected  reward  rate  at  time  t .  E[X(f  )],  for  these  four  cases. 
The  evaluations  of  the  cumulative  reward.  E[Y(f)],  is  discussed  in  [ll].  In  practical  terms  the 
differences  provide  an  estimate  of  the  loss  in  reward  due  to  various  error  types  assuming  that  the 
jobs  are  initiated  when  the  system  is  fully  operational.  As  an  example,  in  Figure  7.  we  find  that 
that  the  SWE  event  degrades  system  effectiveness  considerably  more  than  the  DASD  event.  This  is 
because  the  reward  rate  of  SWE  error  is  lower  than  DASD  error  even  though  the  error  probability 


Minutes 

(b) 

Figure  7.  Expected  reward  rate.  E[X(t)] 


1  SHARPE  is  t  modeling  tool .  It  provides  severs!  mode!  typos  ranging  from  reliability  block  diagrams  to  complex 
Markov  models  ( 17]. 
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of  DASD  event  is  higher  than  of  SWE  event. 


7.  Conclusion 


In  this  study,  we  have  proposed  a  methodology  to  construct  a  model  of  resource  usage,  error 
and  recovery  in  a  computer  system,  using  real  data  from  a  production  system.  The  semi-Markov 
model  obtained  is  capable  of  reflecting  both  the  normal  and  error  behavior  of  our  measured  system. 
The  errors  are  classified  into  various  types,  based  on  the  components  involved.  Both  hardware  and 
software  errors  are  considered,  and  the  interaction  between  the  system  components  (hardware  and 
software)  is  reflected  in  a  multiple  error  model.  The  proposed  reward  measure  allows  us  to  predict 
the  performability  of  the  system  based  on  the  service  and  error  rates.  It  is  suggested  that  other 
production  systems  be  similarly  analyzed  so  that  a  body  of  realistic  data  on  computer  error  and 
recovery  models  is  available. 
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